Abstract
Background Rigorously benchmarking methods for exponential growth detection requires metagenomic datasets where the ground truth is known. Simulation is a flexible tool which could meet this need by enabling generation of datasets from known data generating processes. Thinking through how one might realistically simulate data is a furthermore a useful step in building understanding of the real data.
Task We aim to specify a hierarchical model for generation of metagenomic time series data, under (1) a baseline behaviour regime, and (2) an exponential growth regime. We expect the model to include modules for the behaviour of organisms under each regime, the relationship between organisms and k-mers, and the metagenomic sequencing observation process.
Findings This work is ongoing, and we believe that the task as specified is still of value. We preliminarily find that specifying an appropriate model for baseline organism behaviour requires some thought, and in particular that the lognormal geometric random walk produces some undesirable features in the data.
Next steps The next steps are to (1) build an understanding of and incorporate models for sequencing, (2) think of simple ways to prevent the issue observed with random walks, and (3) save some simulated data-sets ready to try inference.
n_organism <- 10
baseline_indices <- 2:n_organism
exponential_indices <- 1
n_bp <- 100
k <- 40
n_kmer <- n_bp - k
time_window <- 14
baseline <- 100
exp_initial <- 1
r <- 0.4
read_depth <- 10^4
Suppose that there are 10 organisms in total in the sample, and nothing else. Each organism has a genome that is exactly 100 base-pairs long. Each genome is broken into \(k\)-mers of length 40. There are 60 \(k\)-mers from each organism, obtained by subtracting the \(k\)-mer length from the genome size. Assume that each \(k\)-mer is unique, and there are no sequencing errors. We collect data over a period of 14 days \(t = 0, 1, \dots\) at a single environmental monitoring site.
We consider two regimes 1) baseline and 2) exponential growth.
In the baseline regime, a species has a concentration \(c^{(b)}\) in water of 100 copy per \(\mu\)L1, independent of the day. \[ c^{(b)}_t = c^{(b)} \] In the exponential growth regime, a species starts at a concentration \(c^{(e)}_0\), and over the 14 day window its concentration increases exponentially according to \[ c^{(e)}_t = c^{(e)}_0 \exp(rt) \] where \(r\) is the growth rate which we take to be 0.4, and \(c^{(e)}_0\) is the initial exponential concentration which we take to be 1 – lower than the baseline concentration as we assume the exponentially increasing organism to be novel.
The true concentration of the baseline organisms over the time window is:
conc_baseline <- rep(baseline, time_window)
conc_baseline
## [1] 100 100 100 100 100 100 100 100 100 100 100 100 100 100
And the true concentration of the exponentially increasing organism over the time window is:
conc_exponential <- exp_initial * exp(r * 0:13)
conc_exponential
## [1] 1.000000 1.491825 2.225541 3.320117 4.953032 7.389056 11.023176 16.444647 24.532530 36.598234 54.598150
## [12] 81.450869 121.510418 181.272242
Suppose that organisms 2, 3, 4, 5, 6, 7, 8, 9, 10 follow the baseline
behaviour and organism 1 follows the exponential behaviour. We will
represent the true concentrations of each organism with a matrix
C:
C <- matrix(
data = NA,
nrow = time_window,
ncol = n_organism
)
C[, exponential_indices] <- conc_exponential
C[, baseline_indices] <- conc_baseline
C
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 1.000000 100 100 100 100 100 100 100 100 100
## [2,] 1.491825 100 100 100 100 100 100 100 100 100
## [3,] 2.225541 100 100 100 100 100 100 100 100 100
## [4,] 3.320117 100 100 100 100 100 100 100 100 100
## [5,] 4.953032 100 100 100 100 100 100 100 100 100
## [6,] 7.389056 100 100 100 100 100 100 100 100 100
## [7,] 11.023176 100 100 100 100 100 100 100 100 100
## [8,] 16.444647 100 100 100 100 100 100 100 100 100
## [9,] 24.532530 100 100 100 100 100 100 100 100 100
## [10,] 36.598234 100 100 100 100 100 100 100 100 100
## [11,] 54.598150 100 100 100 100 100 100 100 100 100
## [12,] 81.450869 100 100 100 100 100 100 100 100 100
## [13,] 121.510418 100 100 100 100 100 100 100 100 100
## [14,] 181.272242 100 100 100 100 100 100 100 100 100
We assume that the true concentration of each \(k\)-mer is that of the corresponding
organism. This may be represented by copying each column of the matrix
C 100 times.
K <- matrix(rep(as.numeric(t(C)), each = n_kmer), nrow = time_window, byrow = TRUE)
The matrix K now has 14 rows, one for each day, and 600
columns, one for each \(k\)-mer (of
which there are 60 times 10). We can calculate proportions, where the
total number of \(k\)-mers is given by
the row sums:
K_norm <- t(apply(K, 1, function(x) x / sum(x), simplify = TRUE))
# useful::topleft(K_norm)
# useful::bottomleft(K_norm)
One way to represent the sequencing process is as a sample from the
collection of \(k\)-mers. For example,
we could consider a multinomial sample with probabilities of sampling
each \(k\)-mer given by
K_norm and sample size given by the read depth 10000.
To demonstrate this, suppose we simulate the sequencing process on
day 1. The proportions of each \(k\)-mer are given by
K_norm[1, ], and we may sample using
rmultinom. A histogram of the \(k\)-mer counts at day 1 under each regime,
showing that the exponential regime is initialised at low count numbers,
is given by:
sample_one <- rmultinom(1, read_depth, K_norm[1, ])
get_regime <- function(organism_index) {
if(organism_index %in% baseline_indices) return("baseline")
if(organism_index %in% exponential_indices) return("exponential")
else return(NA)
}
#' Testing that this function works as intended
stopifnot(purrr::map_chr(1:2, get_regime) == c("exponential", "baseline"))
data.frame(count = sample_one) %>%
mutate(
organism = rep(1:n_organism, each = n_kmer),
regime = str_to_title(purrr::map_chr(organism, get_regime)),
) %>%
ggplot(aes(x = count, group = regime, fill = regime)) +
geom_histogram(alpha = 0.8) +
labs(x = "k-mer count", y = "Occurances", fill = "Regime", title = "k-mer counts at day 1") +
scale_fill_manual(values = cbpalette) +
theme_minimal()
Now, we will take multinomial samples from all of the days with a
call to apply:
sample <- apply(K_norm, 1, function(row) rmultinom(1, read_depth, row))
colnames(sample) <- paste0("day", 1:ncol(sample))
rownames(sample) <- paste0(1:nrow(sample))
sample_df <- sample %>%
as.data.frame() %>%
tibble::rownames_to_column("id") %>%
pivot_longer(
cols = starts_with("day"),
names_to = "day",
values_to = "count",
names_prefix = "day"
) %>%
mutate(
id = as.numeric(id),
day = as.numeric(day),
kmer = rep(rep(1:n_kmer, each = time_window), times = n_organism),
organism = rep(1:n_organism, each = n_kmer * time_window),
regime = str_to_title(purrr::map_chr(organism, get_regime))
)
saveRDS(sample_df, "sample0.rds")
The data frame sample_df filtered to the first \(k\)-mer from the first organism is given
by:
Let’s plot the data from organism 1 which we have set to be exponentially increasing, together with the data from all other organisms which we have set to follow a baseline distribution:
sample_summary <- sample_df %>%
group_by(day, regime) %>%
summarise(
count_upper = quantile(count, 0.95),
count_median = median(count),
count_lower = quantile(count, 0.05)
)
ggplot(sample_summary, aes(x = day, ymin = count_lower, y = count_median, ymax = count_upper, group = regime)) +
geom_ribbon(alpha = 0.1, aes(fill = regime)) +
geom_line(aes(col = regime), size = 1.5) +
geom_line(data = sample_df, aes(x = day, y = count, col = regime, group = id),
alpha = 0.025, inherit.aes = FALSE) +
theme_minimal() +
scale_color_manual(values = cbpalette) +
scale_fill_manual(values = cbpalette) +
guides(fill = "none") +
labs(x = "Day", y = "Number of reads in sample", col = "Regime")
pickout_day <- 10
With these settings, on day 10 the median count of organism 1 is 18, 7 and \(k\)-mers from organism 1 represent 10.8, 4.2% of the total reads.
Now, we will start to make this simulation more realistic by adding sources of noise.
In reality, the organism abundances will vary, neither being fixed to a constant or increasing deterministically. We will now suppose that the abundances vary stochastically as follows.
For the baseline concentration, one possible model is a lognormal geometric random walk such that \[ c^{(b)}_t = c^{(b)}_{t - 1} \times \exp \left( \epsilon^{(b)}_t \right) \] where \(\epsilon^{(b)}_t \sim \mathcal{N}(0, \sigma_b^2)\). Another way to calculate the baseline concentration at time \(t\), \(c^{(b)}_t\), is as \(c^{(b)}_t = c^{(b)}_{0} \times \exp \left( x^{(b)}_t \right)\) where \(x^{(b)}_t = \sum_{s = 1}^t \epsilon^{(b)}_s\) follows a random walk. The expected value and variance of this random walk are \[ \mathbb{E}[x^{(b)}_t] = \mathbb{E} \left[ \sum_{s = 1}^t \epsilon^{(b)}_s \right] = \sum_{s = 1}^t \mathbb{E} \left[ \epsilon^{(b)}_s \right] = 0, \\ \mathbb{Var} [ x^{(b)}_t ] = \mathbb{Var} \left[ \sum_{s = 1}^t \epsilon^{(b)}_s \right] = \sum_{s = 1}^t \mathbb{Var} \left[ \epsilon^{(b)}_s \right] = t\sigma_b^2 \] We can use the formula for the moment generating function of a Gaussian random variable \(X \sim \mathcal{N}(\mu, \sigma^2)\), \(m_X(u) = \mathbb{E} \left[ e^{uX} \right] = \exp(\mu u + \sigma^2u^2 / 2)\), to calculate that the expected concentration at time \(t\) under this model is \[ \mathbb{E} \left[ c^{(b)}_t \right] = c^{(b)}_{0} \exp(\sigma_b^2/2) > c^{(b)}_{0}. \]
n_baseline_paths <- 8
To show how this behaviour looks, let’s simulate 8 baseline paths.
baseline_df <- data.frame(
day = rep(1:time_window, times = n_baseline_paths),
epsilon = rnorm(time_window * n_baseline_paths),
path_number = as.factor(rep(1:n_baseline_paths, each = time_window))
) %>%
group_by(path_number) %>%
arrange(path_number) %>%
mutate(
x = cumsum(epsilon),
conc = baseline * exp(x)
) %>%
ungroup()
time_series_plot <- function(df, title) {
df %>%
pivot_longer(
cols = c("epsilon", "x", "conc"),
names_to = "variable",
values_to = "value"
) %>%
mutate(
variable = recode_factor(variable,
"epislon" = "epsilon",
"x" = "x",
"conc" = "Concentration")
) %>%
ggplot(aes(x = day, y = value, group = path_number, col = path_number)) +
geom_line() +
facet_wrap(~variable, ncol = 1, scales = "free") +
scale_colour_manual(values = cbpalette) +
theme_minimal() +
labs(x = "Day", y = "", col = "Path number", title = title)
}
time_series_plot(baseline_df, "Baseline lognormal geometric random walk behaviour")
The key takeaway here about the lognormal geometric random walk model is that even though the noise is IID and Gaussian, when you take the cumulative sum and exponentiate it can lead to large deviations in concentration, which follows from the variance of a random walk increasing linearly in time. The maximum concentration value observed was 32262.8096221 – which is 322.6280962 greater than the baseline concentration.
For the exponential regime, we could also suppose a geometric lognormal random walk \[ c^{(e)}_t = c^{(e)}_{t - 1} \times \exp \left( \epsilon^{(e)}_t \right) \] where \(\epsilon^{(e)}_t \sim \mathcal{N}(r, \sigma_e^2)\) and the growth rate \(r > 0\), and the expected concentration at time \(t\) under this model is \(\mathbb{E} \left[ c^{(e)}_t \right] = c^{(e)}_{0} \exp(rt + \sigma_e^2/2)\).
Let’s simulate some paths from this distribution as before.
exponential_df <- data.frame(
day = rep(1:time_window, times = n_baseline_paths),
epsilon = rnorm(time_window * n_baseline_paths, mean = r),
path_number = as.factor(rep(1:n_baseline_paths, each = time_window))
) %>%
group_by(path_number) %>%
arrange(path_number) %>%
mutate(
x = cumsum(epsilon),
conc = baseline * exp(x)
) %>%
ungroup()
time_series_plot(exponential_df, "Exponential lognormal geometric random walk behaviour")
To remedy the issue observed above, we could use a auto-regressive process for \(x^{(b)}_t\) rather than a random walk. Such a model is specified recursively by \[ x^{(b)}_1 \sim \left( 0, \frac{1}{1 - \rho^2} \right), \\ x^{(b)}_t = \rho x^{(b)}_{t - 1} + \epsilon^{(b)}_t, \quad t = 2, \ldots, T, \] where the lag-one correlation parameter \(\rho \in (0, 1)\) determines the extent of autocorrelation, and \(\epsilon^{(b)}_t \sim \mathcal{N}(0, 1)\) as before.
baseline_df <- data.frame(
day = rep(1:time_window, times = n_baseline_paths),
epsilon = rnorm(time_window * n_baseline_paths),
path_number = as.factor(rep(1:n_baseline_paths, each = time_window))
) %>%
group_by(path_number) %>%
arrange(path_number) %>%
mutate(
x = stats::arima.sim(model = list(order = c(1, 0, 0), ar = 0.75), n = time_window, innov = epsilon),
conc = baseline * exp(x)
) %>%
ungroup()
time_series_plot(baseline_df, "Baseline lognormal geometric auto-regressive behaviour")
See Townes (2020) for a nice review of models for count data.
Rather than assuming that during sequencing the probabilities of sampling each \(k\)-mer are given by the true proportions of the \(k\)-mer, we might want to add additional noise. One way to do this is to sample the proportions \(w\) from a probability distribution over the simplex such as the Dirichlet. Given a vector of positive real numbers \(\alpha = (\alpha_1, \ldots, \alpha_K)\) then \(w \sim \text{Dirichlet}(\alpha_1, \ldots, \alpha_K)\) if \[ p(w; \alpha) = \frac{\Gamma(\sum_k \alpha_k)}{\prod_k \Gamma(\alpha_k)} w_1^{\alpha_1 - 1} w_2^{\alpha_2 - 1} \times \cdots \times w_K^{\alpha_K - 1} \] The expected value of \(w_i\) is \(\alpha_i / \sum_{k = 1}^K \alpha_k\).
dirichlet_draw <- gtools::rdirichlet(n = 1, alpha = K_norm[1, ])
The units don’t matter much.↩︎